Learning Articles - Data Science

Lacrimae rerum. Memento mori. Memento vivere.

IPython Interpreter And Jupyter Notebooks

As Python is an interpreted language, it is required for an interpreter to run a program by executing a single statement at a time. IPython is an interpreter designed for both interactive computing and software development, while encouraging an execute-explore workflow instead of the typical edit-compile-run workflow of other programming languages. In addition, it provides directly integrated access to the shell and filesystem of the operating system (removing the need to switch between the current session and terminal). Juypter is an initiative to design language-agnostic interactive computing tools and allows for IPython to be used as a kernel for using Python with Jupyter. With regard to data analysis, IPython and Jupyter are essential in allowing for efficient exploration, interaction, testing, debugging, and iteration. These notes rely on the ideas and learnings from the respective package documentations, "Python For Data Analysis: Data Wrangling With Pandas, NumPy, And Jupyter", 3rd Edition, by Wes McKinney (creator and developer of Pandas) in 2022, and "Python Data Science Handbook: Essential Tools For Working With Data", 2nd Edition, by Jake VanderPlas in 2022.

https://docs.python.org/3/index.html https://ipython.readthedocs.io/en/stable/ https://docs.jupyter.org/en/latest/

IPython Shell System

IPython can be seen as an enhanced Python interpreter which offers additional features relative to the standard Python interpreter and IDLE (Integrated Development And Learning Environment). In more detail, an IPython shell and kernel has the comprehensive object introspection; input history which is persistent across sessions; caching of output results during a session with automatically generated references; extensible tab completion with support for completion of variables, functions, arguments, keywords, and file names; extensible system of commands for controlling the environment and performing many tasks related to IPython or the operating system; rich configuration system with the ability to switch between different setups; session logging and reloading; extensible syntax processing for special purpose situations; access to the system shell with user-extensible alias system; integrated access to the debuggers and profilers; and creation of rich display of HTML, images, sounds, videos, and LaTeX. It should also be noted that the IPython shell will usually render text with syntax highlighting for improved readability.

https://ipython.org/

...

Launch the IPython shell from the command line and execute a file to run:

~ $ ipython

~ $ ipython
										In [1]: %run file.py

Use object introspection to display general information about a variable or function:

In [1]: variable?

In [1]: function?

Search the IPython namespace or scope to show all names matching the wildcard expression:

In [1]: numpy.*load*?

In [1]: pandas.*read*?

Configuration...

...

%timeit

Juypter Notebook System

A primary component of the Juypter project is the notebook system. The notebook system provides a means for creating rich and interactive documents by allowing for the authoring of content in HTML or Markdown alongside source code, data visualizations, and other outputs. A notebook interacts with kernels, which are implementations of the Jupyter computing protocol specific to different programming languages. The Python Juypter kernel uses the IPython system for its underlying behaviour through ipykernel (currently support for over 40 programming languages). Although usually used as local computing environments, a notebook can also be deployed on servers and accessed remotely.

https://jupyter.org/

When creating a notebook, a local server will be started to host the notebook. A new notebook can then be created by visiting the URL of the server. When a notebook is saved, all of the content is stored including any evaluated code output in a self-contained file format as .IPYNB. The notebook can be edited in the native web-based interface from a browser or there are various integrated development environments which can be used for additional features, such as Spyder, Visual Studio Code (with extensions for Python and Jupyter), and JupyterLab. It should be noted that many integrated development environments, such as Spyder and Visual Studio Code, can directly open a notebook from an .IPYNB file without manually starting a server.

Install Jupyter with JupyterLab as an integrated development environment:

				~ $ pip install ipykernel
				~ $ pip install notebook

				~ $ pip install jupyter

				~ $ pip install jupyterlab

Create a new notebook and start a local server for hosting the notebook:

~ $ jupyter notebook

Configuration...

Screenshots of native, Spyder, VS Code, and JupyterLab

Tuples, Lists, Dictionaries, Sets

From the built-in data structures, the most frequently used sequence types include tuples, lists, dictionaries, and sets. A tuple is a fixed-length and immutable sequence of objects which cannot be modified once assigned - in other words, it is not possible to modify which objects are stored in each slot of a tuple (although the objects within the slots may be modified if they are mutable). A list is a variable-length and mutable sequence of objects which can be modified once assigned. A dictionary (possibly referred to as a hash map or associative array in other programming languages) stores a collection of key-value pairs, where the keys and their associated values are objects (although the keys generally have to be immutable objects like scalar types (strings, integers, or floats) or tuples (only containing immutable objects) for hashability). A set is an unordered collection of unique objects (although the objects generally have to be immutable objects like scalar types (strings, integers, or floats) or tuples (only containing immutable objects) for hashability). Each type has additional methods for expansion, such as indexing, concatenating, sorting, finding sizes, counting occurrences, appending objects, inserting objects, removing objects, or set operations.

https://docs.python.org/3/library/stdtypes.html https://docs.python.org/3/tutorial/datastructures.html

Examples of frequently used sequence types including a tuple, list, dictionary, and set:

variable_tuple = (0, 1, 2, 3.1415, "Example", True, False, ("X", 3, None), ["Y", range (10)])

variable_list = [0, 1, 2, 3.1415, "Example", True, False, ("X", 3, None), ["Y", range (10)]]

variable_dictionary = {"A": 0, "B": [True, False], 1: (None, 1), ("Key", True): ["Example", 3.1415]}

variable_set = {0, 1, 2, 3.1415, "Example", True, False, ("X", 3, None), ("Y", range (10))}

There are also several useful sequence functions which include enumeration, sorting, zipping, and reversing. Enumeration is often used to allow the index in a sequence type to be tracked in a for-loop along with the values of the object being iterated. Sorting is used to create a sorted list of the values in an object, where the values are ...distributed... alphanumerically. Zipping pairs up the values of a number of objects to create a new list of tuples with the associated pairs in each tuple (number of elements determined by the shortest object). Reversing creates a generator to iterate over the value of an object in reverse order. The use of comprehensions can also be helpful for concisely forming new objects by filtering and performing an operation on the values of a sequence type (list, dictionary, or set). Alternatively, the map function can be used in a similar manner to comprehensions (without capabilities for filtering).

Add links.

Examples of the formatting for a comprehension using a list, dictionary, and set:

variable_list = [value.upper () for value in collection if len (value) > 8]

variable_dictionary = {value: index for index, value in enumerate (collection) if value < 7}

variable_set = {value.count () for value in collection if value [0] == "A"}

Virtual Environments

Since it is often necessary to use packages and modules which do not come with the standard library, it can be helpful to isolate the environment for the specific project, such that packages and modules can be managed at specific versions. This can be done through a virtual environment, which is a self-contained directory tree containing an installation for a particular version of Python and additional packages and modules. In other words, a virtual environment allows for a cooperatively isolated runtime environment which allows users and applications to install and upgrade packages and modules without interfering with the behaviour of other users and applications running on the same system. Activating a virtual environment will prepend the directory to PATH, so that running a script will invoke the interpreter of the environment and installed scripts can be run without having to use their full paths.

Create a virtual environment for a project with isolated packages and modules (conventionally in .venv):

~ $ python3 -m venv path/to/Environment

Activate a virtual environment as the runtime environment to be used for the project:

~ $ source path/to/Environment/bin/activate

Deactivate the virtual environment which is currently active:

(Environment) ~ $ deactivate

Upgrade the version of Python of a virtual environment (requires re-install of packages and modules):

 ~ $ python3 -m venv --upgrade path/to/Environment

Once activated, packages and modules will only be installed relative to the virtual environment. For convenience, a requirements list can be included at the root of the project as ./requirements.txt. This file should contain the relevant packages and modules with their associated versions for the project. To upgrade a package or module, the version can be updated in the requirements list and re-installed to implement the changes. If the packages and modules have already been installed for a project, a requirements list can simply be created from these existing packages and modules.

Example of a requirements list with popular packages and modules:

				# Web Framework
				Flask==2.0.2

				# Database ORM
				SQLAlchemy<=1.4.25

				# Data Analysis
				pandas==1.3.3
				numpy==1.21.2

				# Data Visualization
				matplotlib==3.4.3
				seaborn==0.11.2

				# Machine Learning
				scikit-learn==0.24.2
				tensorflow==2.7.0
				pytorch==1.9.1

				# Authentication
				bcrypt>=3.2.0
				pyjwt>=2.3.0

				# Testing
				pytest==6.2.4
				coverage==6.2.2

Install the packages and modules included in the requirements list as requirements.txt:

				(Environment) ~ $  python -m pip install --requirements requirements.txt

				(Environment) ~ $  python -m pip install --requirements requirements.txt --force-reinstall

Create a requirements list from the packages and modules currently installed in the virtual environment:

				(Environment) ~ $  python -m pip freeze > requirements.txt/pre>

If Spyder is used for development, an option is to activate the virtual environment and then install Spyder through Pip. It is then possible to launch Spyder from within the virtual environment, after which the interpreter for IPython needs to be edited to point to the particular version of Python for the virtual environment. Alternatively, to avoid installing a version of Spyder in each virtual environment and allow for flexibility and configurability, a modular approach can be followed, where Spyder can be installed in the base environment, necessary kernels can be installed in the virtual environment, and then the path to the interpreter for IPython can be edited in preferences to point to the particular version of Python for the virtual environment (although this will need to be edited each time the virtual environment is changed).

If Visual Studio Code is used for development, the path to the interpreter for IPython can be edited in preferences to point to the particular version of Python for the virtual environment (although this will need to be edited each time the virtual environment is changed). Conveniently, this can be associated with the current workspace folder.

Install Spyder in the base environment and then use a modular approach for each virtual environment:

				~ $ source path/to/Environment/bin/activate
				(Environment) ~ $ pip install spyder-kernels
				(Environment) ~ $ python3 -c "import sys; print(sys.executable)"
				(Environment) ~ $ deactivate
				Preferences > Python Interpreter > Use The Following Interpreter

Install Visual Studio Code in the base environment and then use a modular approach for each virtual environment:

				~ $ source path/to/Environment/bin/activate
				(Environment) ~ $ python3 -c "import sys; print(sys.executable)"
				(Environment) ~ $ deactivate
				Command Palette > Python: Select Interpreter